========================================================
This dataset contains 113,937 loans with 81 variables for each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The data set was last updated on March 11, 2014.
## [1] 113937 81
In this analysis, we will limit the amount of variables we analyze to 13 rather than 81 variables. The variables below were selected out of an interest to see any relationships and trends between them.
## [1] 113937 13
## 'data.frame': 113937 obs. of 13 variables:
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CreditScoreRangeLower: int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper: int 659 699 499 819 699 759 699 719 839 839 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
I will try to order the factor levels of loan status and employment status by perceived level of benefit and risk for the loan company (from worst to best). I understand this may be subjective, but I want to see how the distributions and plots display with these variables ordered.
## [1] "ChargedOff" "Defaulted"
## [3] "Cancelled" "Past Due (>120 days)"
## [5] "Past Due (91-120 days)" "Past Due (61-90 days)"
## [7] "Past Due (31-60 days)" "Past Due (16-30 days)"
## [9] "Past Due (1-15 days)" "FinalPaymentInProgress"
## [11] "Current" "Completed"
## [1] "Not employed" "No answer" "Not available" "Other"
## [5] "Retired" "Self-employed" "Part-time" "Employed"
## [9] "Full-time"
## Term LoanStatus BorrowerRate
## Min. :12.00 Current :56576 Min. :0.0000
## 1st Qu.:36.00 Completed :38074 1st Qu.:0.1340
## Median :36.00 Defaulted : 5018 Median :0.1840
## Mean :40.83 Past Due (1-15 days) : 806 Mean :0.1928
## 3rd Qu.:36.00 Past Due (31-60 days): 363 3rd Qu.:0.2500
## Max. :60.00 (Other) : 1108 Max. :0.4975
## NA's :11992
## ProsperScore EmploymentStatus IsBorrowerHomeowner
## Min. : 1.00 Employed :67322 False:56459
## 1st Qu.: 4.00 Full-time :26355 True :57478
## Median : 6.00 Self-employed: 6134
## Mean : 5.95 Not available: 5347
## 3rd Qu.: 8.00 Other : 3806
## Max. :11.00 No answer : 2255
## NA's :29084 (Other) : 2718
## CreditScoreRangeLower CreditScoreRangeUpper OpenCreditLines
## Min. : 0.0 Min. : 19.0 Min. : 0.00
## 1st Qu.:660.0 1st Qu.:679.0 1st Qu.: 6.00
## Median :680.0 Median :699.0 Median : 9.00
## Mean :685.6 Mean :704.6 Mean : 9.26
## 3rd Qu.:720.0 3rd Qu.:739.0 3rd Qu.:12.00
## Max. :880.0 Max. :899.0 Max. :54.00
## NA's :591 NA's :591 NA's :7604
## AmountDelinquent DebtToIncomeRatio StatedMonthlyIncome
## Min. : 0.0 Min. : 0.000 Min. : 0
## 1st Qu.: 0.0 1st Qu.: 0.140 1st Qu.: 3200
## Median : 0.0 Median : 0.220 Median : 4667
## Mean : 984.5 Mean : 0.276 Mean : 5608
## 3rd Qu.: 0.0 3rd Qu.: 0.320 3rd Qu.: 6825
## Max. :463881.0 Max. :10.010 Max. :1750003
## NA's :7622 NA's :8554
## MonthlyLoanPayment
## Min. : 0.0
## 1st Qu.: 131.6
## Median : 217.7
## Mean : 272.5
## 3rd Qu.: 371.6
## Max. :2251.5
##
##
## 12 36 60
## 1614 87778 24545
There are only three distinct values to the loan term (12, 36, and 60 months). Going forward, we will treat the Term variable as a factor due to its categorical nature.
We can clearly see that most loans have a term of 36 months, and it’s much less common for a loan to have a term of 12 months.
##
## ChargedOff Defaulted Cancelled
## 0 5018 5
## Past Due (>120 days) Past Due (91-120 days) Past Due (61-90 days)
## 16 304 313
## Past Due (31-60 days) Past Due (16-30 days) Past Due (1-15 days)
## 363 265 806
## FinalPaymentInProgress Current Completed
## 205 56576 38074
Most of the loans are either Current or Completed, but there are also many that are Charged off or Defaulted. Of the loans that are Past Due, the highest number are past due by 1-15 days. There are hardly any loans that are categorized as > 120 days past due. My theory is that this is due to the loans being considered Charged off shortly after they become > 120 days past due.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
Monthly loan payment is skewed to the right, with a minimum of $0 and a maximum of $2251.50. The minimum of $0 is interesting, as I expected all monthly loan payments to be greater than $0. I want to take a closer look at loans with monthly payments of $0 and those with high monthly payments.
## Term LoanStatus BorrowerRate ProsperScore
## 12:161 Completed :800 Min. :0.0500 Min. : 1.000
## 36:774 Defaulted :131 1st Qu.:0.1224 1st Qu.: 6.000
## 60: 0 FinalPaymentInProgress: 4 Median :0.1875 Median : 8.000
## ChargedOff : 0 Mean :0.1943 Mean : 7.605
## Cancelled : 0 3rd Qu.:0.2635 3rd Qu.: 9.000
## Past Due (>120 days) : 0 Max. :0.3600 Max. :10.000
## (Other) : 0 NA's :451
## EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
## Full-time :444 False:558 Min. : 0.0
## Employed :269 True :377 1st Qu.:580.0
## Not available: 79 Median :660.0
## No answer : 39 Mean :652.7
## Self-employed: 35 3rd Qu.:720.0
## Part-time : 35 Max. :860.0
## (Other) : 34 NA's :11
## CreditScoreRangeUpper OpenCreditLines AmountDelinquent
## Min. : 19.0 Min. : 0.000 Min. : 0.0
## 1st Qu.:599.0 1st Qu.: 4.000 1st Qu.: 0.0
## Median :679.0 Median : 7.000 Median : 0.0
## Mean :671.7 Mean : 7.543 Mean : 878.4
## 3rd Qu.:739.0 3rd Qu.:10.000 3rd Qu.: 20.0
## Max. :879.0 Max. :27.000 Max. :81723.0
## NA's :11 NA's :118 NA's :118
## DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
## Min. : 0.0100 Min. : 0 Min. :0
## 1st Qu.: 0.1000 1st Qu.: 2750 1st Qu.:0
## Median : 0.1700 Median : 4110 Median :0
## Mean : 0.2441 Mean : 4836 Mean :0
## 3rd Qu.: 0.2800 3rd Qu.: 5944 3rd Qu.:0
## Max. :10.0100 Max. :108333 Max. :0
## NA's :64
Subsetting the data to view loans with monthly payments of $0, we can see that categories such as CreditScoreRangeLower and StatedMonthlyIncome are lower than the overall median. Interestingly, the ProsperScore is higher than average for this subset.
## Term LoanStatus BorrowerRate ProsperScore
## 12:8 Completed :7 Min. :0.05650 Min. : 3.00
## 36:0 FinalPaymentInProgress:1 1st Qu.:0.06248 1st Qu.: 7.25
## 60:0 ChargedOff :0 Median :0.07715 Median : 8.00
## Defaulted :0 Mean :0.09001 Mean : 7.50
## Cancelled :0 3rd Qu.:0.11865 3rd Qu.: 8.50
## Past Due (>120 days) :0 Max. :0.14580 Max. :10.00
## (Other) :0
## EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
## Employed :8 False:4 Min. :740.0
## Not employed :0 True :4 1st Qu.:755.0
## No answer :0 Median :760.0
## Not available:0 Mean :767.5
## Other :0 3rd Qu.:780.0
## Retired :0 Max. :820.0
## (Other) :0
## CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
## Min. :759.0 Min. : 6.00 Min. :0 Min. :0.1100
## 1st Qu.:774.0 1st Qu.: 7.75 1st Qu.:0 1st Qu.:0.1625
## Median :779.0 Median : 9.50 Median :0 Median :0.1800
## Mean :786.5 Mean :11.00 Mean :0 Mean :0.2137
## 3rd Qu.:799.0 3rd Qu.:13.25 3rd Qu.:0 3rd Qu.:0.2550
## Max. :839.0 Max. :21.00 Max. :0 Max. :0.3900
##
## StatedMonthlyIncome MonthlyLoanPayment
## Min. : 7292 Min. :2112
## 1st Qu.:12333 1st Qu.:2144
## Median :18125 Median :2159
## Mean :16589 Mean :2170
## 3rd Qu.:20208 3rd Qu.:2189
## Max. :25000 Max. :2252
##
Subsetting the data to view loans with monthly payments over $2000, it seems that the borrower generally has a high stated monthly income, has no amount delinquent, is employed, and has a high credit score. Also, the terms are all 12 months and the borrower rates are all under 0.15. The borrowers in these loans appear to be “low risk” borrowers as the higher median Prosper Score indicates.
The Prosper Score appears to be a normal distribution. The plot above excludes any ‘NA’ values. The Prosper Score is also supposed to range from 1-10, so the scores of 11 were unexpected. I’m hesitant to treat a score of 11 as the best score since it may be an entry error. Let us subset the data to see if there is a trend in the loans with a Prosper Score of 11 and if the data is trustworthy.
## Term LoanStatus BorrowerRate ProsperScore
## 12: 0 Current :1419 Min. :0.06050 Min. :11
## 36:1049 Completed : 31 1st Qu.:0.06590 1st Qu.:11
## 60: 407 FinalPaymentInProgress: 3 Median :0.08690 Median :11
## Defaulted : 1 Mean :0.09328 Mean :11
## Past Due (61-90 days) : 1 3rd Qu.:0.10990 3rd Qu.:11
## Past Due (1-15 days) : 1 Max. :0.19500 Max. :11
## (Other) : 0
## EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
## Employed :1365 False:483 Min. :640
## Self-employed: 43 True :973 1st Qu.:700
## Other : 34 Median :720
## Full-time : 14 Mean :732
## Not employed : 0 3rd Qu.:760
## No answer : 0 Max. :840
## (Other) : 0
## CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
## Min. :659 Min. : 2.00 Min. : 0 Min. :0.0200
## 1st Qu.:719 1st Qu.: 8.00 1st Qu.: 0 1st Qu.:0.1300
## Median :739 Median :11.00 Median : 0 Median :0.1900
## Mean :751 Mean :11.32 Mean : 298 Mean :0.2007
## 3rd Qu.:779 3rd Qu.:14.00 3rd Qu.: 0 3rd Qu.:0.2500
## Max. :859 Max. :38.00 Max. :106071 Max. :0.5800
## NA's :44
## StatedMonthlyIncome MonthlyLoanPayment
## Min. : 1200 Min. : 43.47
## 1st Qu.: 5417 1st Qu.: 243.56
## Median : 7500 Median : 402.31
## Mean : 8850 Mean : 424.04
## 3rd Qu.:10833 3rd Qu.: 581.40
## Max. :76667 Max. :1162.33
##
The summary of loans with a Prosper Score of 11 do seem to indicate borrowers that are low risk. The median StatedMonthlyIncome and OpenCreditLines are higher than the overall median. I think this data can be trusted and used.
Borrower Rate has a fairly normal distribution, but there is a second peak around 0.31. The Prosper Score and Borrower Rate are both generated values, and I’m interested in exploring which variables affect these values.
Almost all of the borrowers are employed. A limitation of this dataset that can be recognized here is that the data is only for those who applied for a loan and were approved. The “riskier”" loan applicants (unemployed) that were denied loans are not a part of this dataset.
I also wonder if leaving the response blank, or the unclear responses of “Not available” and “Other” affect the borrower’s interest rate and Prosper score.
There is an even split between homeowners and non-homeowners. Does being a homeowner affect the Prosper Score? How about the borrower rate?
The stated monthly income is skewed to the right, but the log transformation produces a more normal distribution. There are very large outliers in the stated monthly income data so the top 1% of data was trimmed for the plot.
Similarly, a majority of the Debt to Income Ratios were between 0 and 1 with some values at the max of 10.01. The top 1% of data was trimmed for the plot above, showing a fairly normal distribution with some skew to the right. I want to take a look at the highest values of debt to income ratio.
## Term LoanStatus BorrowerRate ProsperScore
## 12: 1 Completed :124 Min. :0.0100 Min. :1.000
## 36:263 Defaulted : 42 1st Qu.:0.1497 1st Qu.:3.000
## 60: 8 Current : 23 Median :0.1900 Median :4.500
## Past Due (61-90 days): 1 Mean :0.2013 Mean :4.478
## Past Due (16-30 days): 1 3rd Qu.:0.2699 3rd Qu.:6.000
## (Other) : 0 Max. :0.3500 Max. :8.000
## NA's : 81 NA's :226
## EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
## Self-employed:74 False:145 Min. :460.0
## Full-time :65 True :127 1st Qu.:620.0
## Not available:37 Median :680.0
## Employed :36 Mean :668.8
## Not employed :27 3rd Qu.:720.0
## Part-time :17 Max. :860.0
## (Other) :16
## CreditScoreRangeUpper OpenCreditLines AmountDelinquent
## Min. :479.0 Min. : 0.000 Min. : 0.0
## 1st Qu.:639.0 1st Qu.: 5.000 1st Qu.: 0.0
## Median :699.0 Median : 8.000 Median : 0.0
## Mean :687.8 Mean : 8.694 Mean : 734.9
## 3rd Qu.:739.0 3rd Qu.:12.000 3rd Qu.: 0.0
## Max. :879.0 Max. :30.000 Max. :37077.0
## NA's :43 NA's :43
## DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
## Min. :10.01 Min. : 0.000 Min. : 0.0
## 1st Qu.:10.01 1st Qu.: 0.083 1st Qu.: 107.2
## Median :10.01 Median : 0.083 Median : 209.5
## Mean :10.01 Mean : 133.299 Mean : 301.9
## 3rd Qu.:10.01 3rd Qu.: 0.396 3rd Qu.: 388.6
## Max. :10.01 Max. :17083.333 Max. :1047.6
##
Subsetting the data by Debt to Income Ratios greater than 10, we begin to see riskier Prosper scores, more Charged off or Defaulted loans, and much lower levels of full-time employment and stated monthly incomes.
##
## 0 360 420 440 460 480 500 520 540 560 580 600
## 133 1 5 36 141 346 554 1593 1474 1357 1125 3602
## 620 640 660 680 700 720 740 760 780 800 820 840
## 4172 12199 16366 16492 15471 12923 9267 6606 4624 2644 1409 567
## 860 880
## 212 27
##
## 19 379 439 459 479 499 519 539 559 579 599 619
## 133 1 5 36 141 346 554 1593 1474 1357 1125 3602
## 639 659 679 699 719 739 759 779 799 819 839 859
## 4172 12199 16366 16492 15471 12923 9267 6606 4624 2644 1409 567
## 879 899
## 212 27
The credit score variables both have normal distributions that look identical, but shifted on the x-axis. Looking at the tables of the values, we can see that the lower credit scores all end in 0 and the upper scores all end in 9. The scores also increment by a factor of 20 and the upper credit scores are all higher than the lower scores by exactly 19. Since these values represent a credit score range that seems to vary by exactly 19 for each borrower, we will create a new variable CreditScore that is the mean of CreditScoreRangeLower and CreditScoreRangeUpper.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
According to ValuePenguin, the average credit score in the U.S. in 2014 was 693. Here we can see that the mean (695) and median (690) of the CreditScore in our data set are right around this value. I’m surprised to see here that the credit scores for those approved for loans is not higher than the average in the U.S. in 2014.
We can also see here that credit scores in our data set range from 9.5 to 889.5, but the actual range for credit scores is 300 to 850. Let’s explore some loans with credit scores outside the normal range.
## Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 795 36 Completed 0.2900 NA Not available
## 913 36 Defaulted 0.2600 NA Not available
## 1686 36 <NA> 0.1625 NA Not available
## 4291 36 <NA> 0.2375 NA No answer
## 4372 36 Defaulted 0.2325 NA Not available
## 4431 36 Defaulted 0.1450 NA Not available
## 4673 36 <NA> 0.1930 NA Not available
## 5685 36 Completed 0.2300 NA Not available
## 6878 36 <NA> 0.0700 NA Not available
## 7740 36 <NA> 0.2900 NA Not available
## 9049 36 Defaulted 0.2900 NA Not available
## 9474 36 <NA> 0.2500 NA Not available
## 13396 36 Defaulted 0.2375 NA No answer
## 13526 36 <NA> 0.2000 NA Not available
## 13687 36 Completed 0.1700 NA Not available
## 13939 36 Defaulted 0.1700 NA Not available
## 14339 36 <NA> 0.1700 NA Not available
## 14726 36 <NA> 0.1700 NA Not available
## 15094 36 Defaulted 0.2900 NA No answer
## 15138 36 <NA> 0.2500 NA No answer
## IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 795 False 0 19
## 913 False 0 19
## 1686 False 0 19
## 4291 False 0 19
## 4372 False 0 19
## 4431 False 0 19
## 4673 False 0 19
## 5685 False 0 19
## 6878 False 0 19
## 7740 False 0 19
## 9049 False 0 19
## 9474 False 0 19
## 13396 True 0 19
## 13526 False 0 19
## 13687 False 0 19
## 13939 False 0 19
## 14339 False 0 19
## 14726 False 0 19
## 15094 False 0 19
## 15138 False 0 19
## OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 795 NA NA NA
## 913 NA NA 0.03000
## 1686 NA NA NA
## 4291 NA NA 0.10368
## 4372 NA NA NA
## 4431 NA NA NA
## 4673 NA NA 0.02000
## 5685 NA NA 0.09000
## 6878 NA NA NA
## 7740 NA NA 0.07000
## 9049 NA NA 0.01000
## 9474 NA NA 0.05000
## 13396 NA NA 0.02393
## 13526 NA NA NA
## 13687 NA NA NA
## 13939 NA NA 0.02000
## 14339 NA NA NA
## 14726 NA NA 0.13000
## 15094 NA NA 0.14000
## 15138 NA NA NA
## StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 795 115.5833 167.62 9.5
## 913 2316.6667 60.44 9.5
## 1686 1600.0000 35.28 9.5
## 4291 2916.6667 156.41 9.5
## 4372 0.0000 58.26 9.5
## 4431 166.6667 34.42 9.5
## 4673 2666.6667 44.17 9.5
## 5685 1083.3333 98.71 9.5
## 6878 2227.5000 30.88 9.5
## 7740 1400.0000 104.76 9.5
## 9049 3166.6667 41.91 9.5
## 9474 2000.0000 101.39 9.5
## 13396 4166.6667 99.71 9.5
## 13526 3896.6667 111.49 9.5
## 13687 2050.0000 35.65 9.5
## 13939 2083.3333 35.65 9.5
## 14339 1083.3333 42.78 9.5
## 14726 916.6667 57.04 9.5
## 15094 1200.0000 83.81 9.5
## 15138 1500.0000 119.28 9.5
## Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 186 36 Defaulted 0.2365 6 Self-employed
## 936 36 Completed 0.0499 NA Full-time
## 1094 36 Current 0.0759 8 Employed
## 1402 36 Completed 0.0820 NA Not available
## 1576 36 Current 0.0625 10 Employed
## 2024 36 Completed 0.3199 3 Self-employed
## 2117 36 Completed 0.0750 NA Self-employed
## 2987 36 Completed 0.0695 NA Retired
## 4083 36 Completed 0.0945 8 Full-time
## 4550 36 Completed 0.0854 10 Employed
## 6521 60 Current 0.0779 10 Employed
## 7183 36 Completed 0.0590 NA Full-time
## 7449 12 Completed 0.0628 9 Retired
## 7475 36 Completed 0.1390 8 Employed
## 7835 36 Current 0.0649 10 Employed
## 8075 36 Completed 0.0830 8 Full-time
## 8731 36 Completed 0.0599 10 Employed
## 8852 36 Completed 0.0825 10 Employed
## 8993 36 Current 0.2049 7 Employed
## 10160 36 Completed 0.1559 5 Self-employed
## IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 186 True 860 879
## 936 True 860 879
## 1094 True 860 879
## 1402 True 860 879
## 1576 True 860 879
## 2024 True 860 879
## 2117 True 860 879
## 2987 True 860 879
## 4083 True 860 879
## 4550 True 860 879
## 6521 True 860 879
## 7183 True 860 879
## 7449 True 860 879
## 7475 True 860 879
## 7835 True 860 879
## 8075 True 880 899
## 8731 True 860 879
## 8852 True 860 879
## 8993 True 880 899
## 10160 True 860 879
## OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 186 5 0 NA
## 936 4 0 0.00
## 1094 6 0 0.17
## 1402 NA NA 0.02
## 1576 8 0 0.03
## 2024 7 0 NA
## 2117 7 0 0.03
## 2987 8 0 0.05
## 4083 12 0 0.15
## 4550 4 0 0.02
## 6521 7 0 0.08
## 7183 15 0 0.14
## 7449 13 0 0.13
## 7475 14 0 0.13
## 7835 7 0 0.06
## 8075 8 0 NA
## 8731 7 0 0.05
## 8852 5 0 0.03
## 8993 7 0 0.26
## 10160 10 0 NA
## StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 186 3333.333 234.30 869.5
## 936 9000.000 29.97 869.5
## 1094 6833.333 186.89 869.5
## 1402 5000.000 94.29 869.5
## 1576 17916.667 213.75 869.5
## 2024 6250.000 108.87 869.5
## 2117 18333.333 31.11 869.5
## 2987 1258.000 46.28 869.5
## 4083 10583.333 640.19 869.5
## 4550 8108.333 94.76 869.5
## 6521 8666.667 282.46 869.5
## 7183 7500.000 121.51 869.5
## 7449 10000.000 172.39 869.5
## 7475 32083.333 563.13 869.5
## 7835 8333.333 245.16 869.5
## 8075 6250.000 157.37 889.5
## 8731 7291.667 243.34 869.5
## 8852 9333.333 235.89 869.5
## 8993 2333.333 149.65 889.5
## 10160 10000.000 524.32 869.5
These loans do seem to be missing a fair amount of values (there are a lot of Prosper scores missing), but since the data here do contain values for many of the variables I will not remove these rows. I will remember to subset the data for any plots or calculations involving credit score to only include scores from 300 to 850.
The number of open credit lines is a fairly normal distribution. Most people have less than 20 open credit lines. The distribution of amount delinquent is, alternatively, clearly skewed to the right. Most people are not deliquent at all. A log transformation of the AmountDelinquent data produces a normal distribution.
The data set being analyzed contains 113,937 loans with 13 variables (Term, LoanStatus, MonthlyLoanPayment, ProsperScore, BorrowerRate, EmploymentStatus, IsBorrowerHomeowner, StatedMonthlyIncome, DebtToIncomeRatio, CreditScoreRangeLower, CreditScoreRangeUpper, OpenCreditLines, and AmountDelinquent).
The original data set contained 81 variables, but in this project we are focusing on 13 variables of interest.
Some initial observations from the data are: - Most loans have a 36 month term - Borrower rate peaks around 0.15, but also again around 0.31 - The median upper credit score is about the same as the average U.S credit score - There are extreme values in the debt to income ratio at 10.01 - Some monthly loan payments are much higher than average - Some Prosper scores are an 11, outside the scale of 1-10
The main feature of interest is ProsperScore. I want to explore how the borrower profile such as EmploymentStatus, CreditScore, and StatedMonthlyIncome affect the ProsperScore. Also, I want to see how ProsperScore relates to the loan details such as BorrowerRate, MonthlyLoanPayment, and LoanStatus.
As stated above, I think the variables relating to the borrower’s profile will all contribute to the ProsperScore. I think the CreditScore, StatedMonthlyIncome, and DebtToIncomeRatio will contribute the most. Further, I think the ProsperScore will correlate with the BorrowerRate and MonthlyLoanPayment, and will be predictive of the LoanStatus.
I created CreditScore by taking the average of the upper and lower credit scores, and I made Term a factor variable since the numerical values fell within 3 distinct categories. I also ordered the EmploymentStatus and LoanStatus factors based on my perception of best to worst from the view of the loan company.
I performed a log transformation on the StatedMonthlyIncome and AmountDelinquent since the data in both plots was skewed to the right. I wanted to see if the log transformation would normalize the data. The transformed data did create a more normal looking distribution for both variables.
## BorrowerRate ProsperScore OpenCreditLines
## BorrowerRate 1.00 -0.66 -0.11
## ProsperScore -0.66 1.00 -0.01
## OpenCreditLines -0.11 -0.01 1.00
## AmountDelinquent 0.05 -0.04 -0.06
## DebtToIncomeRatio 0.13 -0.14 0.15
## StatedMonthlyIncome -0.15 0.16 0.19
## MonthlyLoanPayment -0.33 0.18 0.19
## CreditScore -0.53 0.38 0.10
## AmountDelinquent DebtToIncomeRatio StatedMonthlyIncome
## BorrowerRate 0.05 0.13 -0.15
## ProsperScore -0.04 -0.14 0.16
## OpenCreditLines -0.06 0.15 0.19
## AmountDelinquent 1.00 -0.03 0.02
## DebtToIncomeRatio -0.03 1.00 -0.14
## StatedMonthlyIncome 0.02 -0.14 1.00
## MonthlyLoanPayment -0.02 0.00 0.30
## CreditScore -0.05 -0.01 0.10
## MonthlyLoanPayment CreditScore
## BorrowerRate -0.33 -0.53
## ProsperScore 0.18 0.38
## OpenCreditLines 0.19 0.10
## AmountDelinquent -0.02 -0.05
## DebtToIncomeRatio 0.00 -0.01
## StatedMonthlyIncome 0.30 0.10
## MonthlyLoanPayment 1.00 0.22
## CreditScore 0.22 1.00
Borrower rate has a moderate (to strong) negative correlation with Prosper score and credit score. There are also weak relationships between some of the other variables. As credit score goes up, Prosper score tends to go up. A higher monthly loan payment also weakly correlates with a higher stated monthly income and a lower borrower rate. Borrower rate seems to have moderately strong correlations with the most variables.
The Prosper score seems to be moderately correlated with borrower rate (negatively) and credit score. Credit score also seems to be moderately correlated with monthly loan payment and negatively with borrower rate.
There don’t seem to be any obvious relationships between Prosper score and any of the categorical variables. I also included borrower rate to see if any of the categorical variables correlated with it, but the relationships are weak.
I want to take a closer look at the relationships between Prosper score and the variables borrower rate, monthly loan payment, credit score, and debt to income ratio.
As Prosper score increases, borrower rate tends to decrease. At the low end and high end of the x-axis the range in borrower rate is lower. In the middle of the plot (Prosper score of 4-8) we can see a much bigger range in the borrower rate.
The slight upward trend in monthly loan payment as Prosper score increases can be seen here. It’s also fairly clear that as Prosper score increases, the biggest change to monthly loan payments is a wider range and greater volume of values. We don’t see many monthly loan payments greater than $500 for Prosper scores of 1-3, but the distribution of monthly loan payments for Prosper scores greater than 3 do not differ much as the score increases. Basically, borrowers with low Prosper scores do not get loans with high monthly payments.
There is a clear trend here showing that a higher credit score corresponds with higher Prosper scores.
In both the scatterplot and boxplot above we can see the slight negative correlation between debt to income ratio and Prosper score. Both plots removed the top 1% of debt to income ratios.
Let us also take a look at borrower rate vs. monthly loan payment and credit score as these variables had moderate correlations with borrower rate.
Changing the plot limits and adding transparency reveals clear sections of the plot where borrower rate increases linearly as the monthly loan payment increases. Also, the highest borrower rates tend to correspond to lower monthly payments. I wonder what may cause the borrower rate to increase without much or any increase to monthly loan payment.
The correlation between credit score and borrower rate is clear here - as credit score increases, borrower rate decreases.
Next, I’ll look at how Prosper score and borrower rate may fluctuate based on the categorical variables (loan status, employment, and homeowners). It is important to remember here that the loan status and employment status levels were ordered by my own thoughts on levels of borrower risk and ideal loan outcomes.
As mentioned above, it’s important to remember that the order of these levels was determined by my own judgment. There is a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend. There also may be a slight downward trend in borrower rate across loan status, but it is not strong at all.
Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time. Based on this, I will begin treating employment status as a nominal variable. Trying to order the employment status categories does not seem feasible any longer.
The borrower rate is generally higher for those that are not employed, but otherwise there isn’t much difference across employment levels.
The borrower’s status as a homeowner does not appear to affect the Prosper score at all, but not owning a home does seem to correspond with a higher borrower rate.
The Prosper score and borrower rate had a moderate negative correlation, with the borrower rate decreasing as Prosper score increased. There was a big range of values in the middle section of Prosper scores (and the middle section of borrower rates).
Low Prosper scores make it highly unlikely that the monthly loan payment will be greater than $500.
A well-below average credit score generally corresponds to a low Prosper score, and a well-above average credit score generally corresponds to a high Prosper score. A credit score between 650-750, on the other hand, corresponds to Prosper scores that can range vastly.
Debt to income ratio increases tended to weakly correlate with lower Prosper scores.
There was a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend.
Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time.
It was surprising to see that being a homeowner had no affect on Prosper score.
Borrower rate tends to decrease as credit score increases, which makes sense. I was also very interested in the plot between borrower rate and monthly loan payment. There were clear segments that showed sectional linear relationships between borrower rate and monthly loan payment. I’m wondering if the distinction of these sections corresponds to any other variables, or if the linear relationships are affected by another variable.
The strongest bivariate relationship was between borrower rate and Prosper score. Credit score also had a fairly strong correlation with borrower rate.
First I want to investigate the relationship above between borrower rate and monthly loan payment against the categorical variables and against Prosper score. This relationship really caught my eye and I want to see if there is more to learn about it.
It’s hard to pull any trends out of the data here. Most of the data contains borrowers who are employed or full-time. For loans with low borrower rates and mid-to-high monthly loan payments, its most common for the borrower to be employed.
Most loans are current, completed, or defaulted, but its hard to see any obvious trends.
Again, there is no obvious relationship here - homeowners may be slightly more likely to have higher monthly payments and lower borrower rates.
Here there is definitely a relationship, with borrower rate increasing as the Prosper score decreases while holding the monthly loan payment constant. It seems that the linear relationship pieces of the plot noted previously seem to correspond to decreases in Prosper score.
Let us move on and look at Prosper score and borrower rate against the categorical variables.
There are no new trends that can be observed from these plots.
Next I want to look at borrower rate and credit score against the categorical variables and also against Prosper score.
Low credit scores tend to correspond to defaulted loans as well as employment statuses of not employed or no answer.
High credit scores tend to correspond to homeowners, while low credit scores correspond to non-homeowners.
Borrowers with a higher Prosper score tend to have lower borrower rates, even if we hold credit score constant.
At first I had wanted to start out by exploring the Prosper score, but the data exploration has made me more interested in predicting the borrower rate. The plots colored by Prosper score show the moderately strong relationships borrower rate has with other variables in the dataset. I want to create a linear model to see how well we can predict borrower rate.
##
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loans_subset_CreditScore)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loans_subset_CreditScore)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + MonthlyLoanPayment,
## data = loans_subset_CreditScore)
##
## ======================================================================
## m1 m2 m3
## ----------------------------------------------------------------------
## (Intercept) 0.317*** 0.648*** 0.630***
## (0.001) (0.003) (0.003)
## ProsperScore -0.020*** -0.017*** -0.016***
## (0.000) (0.000) (0.000)
## CreditScore -0.000*** -0.000***
## (0.000) (0.000)
## MonthlyLoanPayment -0.000***
## (0.000)
## ----------------------------------------------------------------------
## R-squared 0.421 0.504 0.535
## adj. R-squared 0.421 0.504 0.535
## sigma 0.057 0.052 0.051
## F 61498.813 43106.032 32531.390
## p 0.000 0.000 0.000
## Log-likelihood 122853.694 129468.834 132199.794
## Deviance 272.796 233.352 218.782
## AIC -245701.389 -258929.668 -264389.588
## BIC -245673.348 -258892.280 -264342.853
## N 84714 84714 84714
## ======================================================================
The model accounts for 53.5% of the variance in the borrower rate for a loan.
Holding monthly loan payments constant, higher Prosper scores generally equate to lower borrower rates. The same relationship can be seen when we hold credit scores constant.
These plots along with the correlations seen in the bivariate plots section led me to create a linear regression model with these variables.
I was mostly surprised to see the absence of any obvious trends across loan status, employment status, and homeownership (other than credit score).
For the amount of variables used in this exploration (compared to the 81 overall variables in the full data set), I think the linear model above is fairly good. With the complexities of all of the factors that go into a loan interest rate, I was surprised to see that Prosper score, credit score, and monthly loan payment alone account for 53.5% of the variance in the borrower rate.
The model obviously has its limitations since there are many other variables that likely affect affect borrower rate in the full data set (and outside it). The date the loan was taken out, the borrower’s state of residence, the type of loan, and many other variables likely have effects on the borrower rate. This is definitely not a perfect model, but I do find it relevant to see that three variables alone can account for over half of the variance in a loan rate.
The borrower’s interest rate distribution is fairly normal, but contains an interesting peak around 0.31. The loan company may approve some ‘riskier’ borrowers for loans and give them higher interest rates in an attempt to make more money.
Borrower rate and monthly loan payment have a very interesting relationship as seen here. There are clear linear trends within the plot. At lower monthly loan payments, the borrower rate can rise quickly for the same loan payment. As the monthly loan payments go up, borrower rate and loan payment increase together.
Prosper score gives a better understanding to the relationship between borrower rate and monthly loan payment. Keeping the monthly loan payment constant, the borrower rate increases as Prosper score decreases (indicating a higher risk borrower). As the monthly loan payment increases, the loan company is taking on a higher risk if the borrower does not pay the money back since they will lose more money. For this reason, the loan company increases both the borrower rate and monthly loan payment as the Prosper score decreases. This way, the loan company is able to earn more interest from the borrower if they do pay the loan back. The extra revenue from interest makes it worth it for them to approve the loan with a higher loan payment for riskier borrowers.
The full dataset used here contains almost 114,000 loans with 81 variables. There is a ton of information in the dataset, and in this exploration I decided to focus on 13 variables that peaked my interest.
I began by exploring the individual distributions of the variables in the dataset. Then, I looked at relationships between the variables and further explored the strongest trends. The exploration culminated in an interest in creating a linear model to predict borrower rate based on three of the variables.
Although I started off most interested in Prosper score, the trends among the variables I chose to explore made borrower rate the more intriguing variable. By choosing only 13 variables to explore from the full dataset, I absolutely introduced some limitations to the model. There are many other factors that I did not explore relating to the borrower rate, which is the reason the model only accounts for 53.5% of the variance in borrower rate.
Other limitations include the fact that this data all comes from Prosper. The relationships and models that come from this dataset may be totally different for loans at another company. The dataset also hasn’t been updated since 2014. Loan rates change frequently and so this data may not properly represent the current state of loans.
As mentioned above, there are 81 variables in the full dataset for each loan. I’d be interested in exploring the other variables in depth as well to see which other variables may contribute to borrower rate. With the full dataset, I also think a model could be built to predict the Prosper score. It would be interesting to see which factors go into the generation of a Prosper score and if a model could be built that could accurately predict what a borrower’s Prosper score would be.